Total Energy Prediction - Neural Network

Introduction

In this notebook we will machine-learn the relationship between a molecular descriptor and total energy using neural networks.

The energy of ~134k molecules was calculated at the CCSD level.

Setup


In [ ]:
# --- INITIAL DEFINITIONS ---
from sklearn.neural_network import MLPRegressor
import numpy, math, random
import matplotlib.pyplot as plt
from scipy.sparse import load_npz
from mpl_toolkits.mplot3d import Axes3D

Let's pick a descriptor. Allowed types are:

  1. cnt: atom counts
  2. bob: bag of bonds
  3. soap: smooth overlap of atomic positions; choose from:
    1. soap.sum - all atoms summed together
    2. soap.mean - mean of all atom SOAP
    3. soap.centre - computed at the central point
  4. mbtr: many-body tensor representation
  5. cm: Coulomb matrix

In [ ]:
# TYPE is the descriptor type
TYPE = "cm"

#show descriptor details
print("\nDescriptor details")
desc = open("./data/descriptor."+TYPE.split('.')[0]+".txt","r").readlines()
for l in desc: print(l.strip())
print(" ")

and load the databases with the descriptors (input) and the correct charge densities (output). Databases are quite big, so we can decide how many samples to use for training.


In [ ]:
# load input/output data
trainIn = load_npz("./data/energy.input."+TYPE+".npz").toarray()
trainOut = numpy.load("./data/energy.output.npy")
trainIn = trainIn.astype(dtype=numpy.float64, casting='safe')

# decide how many samples to take from the database
samples  = min(trainIn.shape[0], 9000)
vsamples = min(trainIn.shape[0]-samples,1000)
print("training samples:   "+str(samples))
print("validation samples: "+str(vsamples))
print("number of features: {}".format(trainIn.shape[1]))

# split between training and validation
validIn = trainIn[samples:samples+vsamples]
validOut = trainOut[samples:samples+vsamples]

trainIn  = trainIn[0:samples]
trainOut = trainOut[0:samples]

# shift and scale the inputs
train_mean = numpy.mean(trainIn, axis=0)
train_std = numpy.std(trainIn, axis=0)
train_std[train_std==0] = 1
for a in range(trainIn.shape[1]):
    trainIn[:,a] -= train_mean[a]

for a in range(trainIn.shape[1]):
    trainIn[:,a] /= train_std[a]
# also for validation set
for a in range(validIn.shape[1]):
    validIn[:,a] -= train_mean[a]
for a in range(validIn.shape[1]):
    validIn[:,a] /= train_std[a]
    

# show the first few descriptors
print("\nDescriptors for the first 5 molecules:")
print(trainIn[0:5])

Next we setup a multilayer perceptron of suitable size. Out package of choice is scikit-learn, but more efficient ones are available.
Check the scikit-learn documentation for a list of parameters.


In [ ]:
# setup the neural network
nn = MLPRegressor(hidden_layer_sizes=(1000,200,50,50),  activation='tanh', solver='lbfgs', alpha=0.01, 
                  learning_rate='adaptive')

Training

Now comes the tough part! The idea of training is to evaluate the ANN with the training inputs and measure its error (since we know the correct outputs). It is then possible to compute the derivative (gradient) of the error w.r.t. each parameter (connections and biases). By shifting the parameters in the opposite direction of the gradient, we obtain a better set of parameters, that should give smaller error. This procedure can be repeated until the error is minimised.

It may take a while...


In [ ]:
# use this to change some parameters during training if the NN gets stuck in a bad spot
nn.set_params(solver='lbfgs')

nn.fit(trainIn, trainOut);

Check the ANN quality with a regression plot, showing the mismatch between the exact and NN predicted outputs for the validation set.


In [ ]:
# evaluate the training and validation error
trainMLOut = nn.predict(trainIn)
validMLOut = nn.predict(validIn)

print ("Mean Abs Error (training)  : ", (numpy.abs(trainMLOut-trainOut)).mean())
print ("Mean Abs Error (validation): ", (numpy.abs(validMLOut-validOut)).mean())

plt.plot(validOut,validMLOut,'o')
plt.plot([min(validOut),max(validOut)],[min(validOut),max(validOut)]) # perfect fit line
plt.xlabel('correct output')
plt.ylabel('NN output')
plt.show()

# error histogram
plt.hist(validMLOut-validOut,50)
plt.xlabel("Error")
plt.ylabel("Occurrences")
plt.show()

Exercises

1. Compare descriptors

Keeping the size of the NN constant, test the accuracy of different descriptors with the same NN size, and find the best one for these systems.


In [ ]:
# DIY code here...

2. Combine with Principal Component Analysis - Advanced

Reduce the descriptor size with PCA (check the PCA.ipynb notebook) and train again. Can you get similar accuracy with much smaller networks?


In [ ]: